DATE: May 20, 2021
TO: Annette Demchur, Rebecca Morgan
FROM: Margaret Atkinson
RE: Staff Initiated Study Proposal:
I would like to conduct research to answer the following question: Could an unsupervised machine learning algorithm create groups of towns based on demographic information that would be useful to explore questions about equity? To explain, when we compare all towns that pass the minority threshold (or TAZs or block groups etc.) to all towns that do not, we may be missing the way the demographic variables interact and a multifactored grouping could allow us to explore the demographic towns with a more detailed approach without the blending nature of an index.
This project would use python and specifically the Scikit-Learn python library to conduct unsupervised machine learning based on demographic data at the town level. The data would be demographic Census data from the American Community Survey 5- year estimates at minimum on the topics of: Race/Ethnicity, Limited English Proficiency, Median Income, Low Income, No Car Households, Population Density, Children, and Seniors. The product would be a geographic file that shows groupings of towns by demographic profile as found by the unsupervised machine learning algorithm as well as a written description of what each grouping represents.
If the question is pursued and the results are useful - the ultimate intention (as a follow up project) would be to look at the way MPO distributes funding between the groups and within each group to look for disparities. Explanations of disparities could lead to a re-examination of the variables used in the algorithm in order to provide an additional check for equitable spending.
After trying two different clustering methods (K-Means and DBSCAN) on the original data and on data representing the relationships between original data variables (Principal Components), I determined that the results of this study are not significant - the demographic variables chosen do not show multiple clusters that are of high enough quality to consider using. The confidence in the clusters created is low - truly it seems that there is one cluster and some outliers that only relate minimally to each other. In order to determine whether this was an issue with number of samples (towns), I applied the same methodology seen here to the Census Tracts of the MPO. A similar result was achieved even with the greater variation and quantity that the Tracts provide.
While this idea has not panned out in a useful way, I propose that the agency continue to think outside the box in terms of how we conduct equity analyses especially given the open source advanced tools that we could be using.
Demographic Data Used
| Demographic | Tables | Fields | Calculation | Notes | |
|---|---|---|---|---|---|
| Race/Ethnicity | SF1 2010: P5 | P005001-P005003 | Total - White Alone | Use Census per direction from Steven & Betsy | |
| Limited English Proficiency | ACS14: B16001 | B16001_005 + B16001_008 + B16001_011 + B16001_014 + B16001_017 + B16001_020 + B16001_023 + B16001_026 + B16001_029 + B16001_032 + B16001_035 + B16001_038 + B16001_041 + B16001_044 + B16001_047 + B16001_050 + B16001_053 + B16001_056 + B16001_059 + B16001_062 + B16001_065 + B16001_068 + B16001_071 + B16001_074 + B16001_077 + B16001_080 + B16001_083 + B16001_086 + B16001_089 + B16001_092 + B16001_095 + B16001_098 + B16001_101 + B16001_104 + B16001_107 + B16001_110 + B16001_113 + B16001_116 + B16001_119 | C16001 (less than very well)/(Total Population - (B01001_003 +B01001_027) | ||
| Median Income | ACS14: B19013 | B19013_001 | |||
| % of HH with income below 200% of poverty line | ACS14: C17002 | C17002_002E + C17002_003E + C17002_004E + C17002_005E + C17002_006E + C17002_007E | |||
| Low Income Households | ACS14: B19001, B19025, B11001 | All of B19001, B19025_001, B11001_001 | HH Income Ranges, Aggregate HH Income, Total HH | ||
| No Car Households | ACS14: B08201 | B08201_002 | HH with no vehicles available | ||
| Population Density | https://jtleider.github.io/censusdata/api.html | B01001_001/AREA | Total Pop / AREA | These are shape, also use total population data | |
| Children | ACS14: B01001, 2010Cen: P12 | (B01001_003 + B01001_004 + B01001_005 + B01001_006 + B01001_027 + B01001_028 + B01001_029 + B01001_030), (P012_003 + P012_004 + P012_005 + P012_006 + P012_027 + P012_028 + P012_029 + P012_030) | Boys under 18 plus girls under 18 | 0-17 | |
| Population Over 5 | ACS14: B01001, 2010Cen: P12 | B01001_001 - (B01001_003 +B01001_027), P012_001 - (P012_003 + P012_027) | Total Population - Children under 5 | 5+ | |
| Seniors | ACS14: B01001 | (B01001_020 + B01001_021 + B01001_022 + B01001_023 + B01001_024 + B01001_025) + (B01001_044 + B01001_045 + B01001_046 + B01001_047 + B01001_048 + B01001_049) | Men ages 65+ plus Women ages 65+ | 65+ | |
| People with Disabilities | ACS14: S1810 | S1810_C02_001E / S1810_C01_001E ( total pop with disability / total non institutionalized population) | Includes: Ambulatory, Hearing, Vision, Self-Care, Cognitive, Independent Living Difficulties | ||
| Total Population | ACS14: B01001, 2010Cen: P1 | B01001_001, P01_001 | Includes those housed in group quarters |
These are sources that I used to frame my methodology. These were used to assess the strengths and weaknesses of clustering approaches, how clustering algorithms should be calibrated, and how to assess the results of the clustering.
import os
os.environ["OMP_NUM_THREADS"] = "1" #no parallel processing b/c memory leak on windows
import numpy as np
import pandas as pd
import geopandas as gpd
import censusdata
import matplotlib
from matplotlib import pyplot
import matplotlib.cm as cm
import sklearn
from functools import reduce
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn import metrics
from scipy.spatial.distance import cdist
from sklearn.decomposition import PCA
import plotly.express as px
#set some parameters to make display of data nicer
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.precision', 2)
#Create list of Town Names and Town IDs for use in filtering the demographic data.
mpoTowns = ["Beverly","Boston","Braintree","Cambridge","Chelsea","Everett","Framingham","Franklin","Gloucester",
"Lynn","Malden","Marlborough","Medford","Melrose","Newton","Peabody","Quincy","Revere","Salem","Somerville",
"Waltham","Watertown","Weymouth","Woburn","Acton","Arlington","Ashland","Bedford","Bellingham","Belmont",
"Bolton","Boxborough","Brookline","Burlington","Canton","Carlisle","Cohasset","Concord","Danvers","Dedham",
"Dover","Essex","Foxborough","Hamilton","Hingham","Holbrook","Holliston","Hopkinton","Hudson","Hull","Ipswich",
"Lexington","Lincoln","Littleton","Lynnfield","Manchester-by-the-Sea","Marblehead","Marshfield","Maynard",
"Medfield", "Medway","Middleton","Milford","Millis","Milton","Nahant","Natick","Needham","Norfolk",
"North Reading","Norwell","Norwood","Randolph","Reading","Rockland","Rockport","Saugus","Scituate",
"Sharon","Sherborn","Southborough","Stoneham","Stow","Sudbury","Swampscott","Topsfield","Wakefield","Walpole",
"Wayland","Wellesley","Wenham","Weston","Westwood","Wilmington","Winchester","Winthrop","Wrentham"]
#Town IDs in the Census
mpoNums = ['05595','16250','21850','26150','27900','32310','37490','37560','37995','38400','41095','43580','52490','55955',
'57880','59105','60015','68645','70150','74595','00380','01605','02130','04615','05070','07350','09840','11000',
'11525','15060','21990','24960','24925','30700','31085','31540','35215','35425','35950','37875','38715','39625',
'39835',
'40115','43895','45560','48955','56130','61380','62535','67665','68050','68260','72215','72600','73440','73790',
'77255','80230','80510','81035','04930','07740','09175','11315','14640','16495','17405','24820','25172','30455',
'39765','39975','41515','41690','44105','46050','50250','55745','56000','60785','72495','74175','78690','78972',
'82315','30210','31645','38855','50145','57775','60330','07000','13205','56585','81005','06365','41165','63165']
These links are documentation for what the tables and values I am using in this next section:
https://www.census.gov/prod/cen2010/doc/sf1.pdf
https://www.census.gov/programs-surveys/acs/technical-documentation/table-shells.2014.html
These are just notes on what fields within what tables that I am using.
#create lists of fields needed for each calculation
#ACS 2014 columns
race = ['P005001','P005003'] #universe = total population
lep = ['B16001_005E','B16001_008E','B16001_011E','B16001_014E','B16001_017E','B16001_020E','B16001_023E','B16001_026E',
'B16001_029E','B16001_032E','B16001_035E','B16001_038E','B16001_041E','B16001_044E','B16001_047E','B16001_050E',
'B16001_053E','B16001_056E','B16001_059E','B16001_062E','B16001_065E','B16001_068E','B16001_071E','B16001_074E',
'B16001_077E','B16001_080E','B16001_083E','B16001_086E','B16001_089E','B16001_092E','B16001_095E','B16001_098E',
'B16001_101E','B16001_104E','B16001_107E','B16001_110E','B16001_113E','B16001_116E','B16001_119E'] #universe = population > age 5
medinc = ['B19013_001E'] #No Universe needed, but HH
povPeop = ['C17002_002E','C17002_003E','C17002_004E','C17002_005E','C17002_006E','C17002_007E']
nocar = ['B08201_002E'] #Universe = Households
u18 = ['B01001_003E','B01001_004E','B01001_005E','B01001_006E','B01001_027E','B01001_028E',
'B01001_029E','B01001_030E'] #universe = total population
sen = ['B01001_020E','B01001_021E','B01001_022E','B01001_023E','B01001_024E','B01001_025E',
'B01001_044E','B01001_045E','B01001_046E','B01001_047E','B01001_048E','B01001_049E']#universe = total population
dis = ['S1810_C02_001E','S1810_C01_001E'] #universe = total NON institutionalized population
acsHH = ['B11001_001E']
acsPlus5 = ['B01001_001E','B01001_003E', 'B01001_027E']
acsTotPop = ['B01001_001E']
#################################GRAB ACS DATA###################################################
#MINORITY
#get table and reset index so that county sub is a field called muni
minority = censusdata.download('sf1',2010,
censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ), ('county subdivision', '*')]),
race).reset_index().rename(columns={'index':'muni'})
#create the actual minority field
minority['Minority#'] = minority['P005001'] - minority['P005003']
#filter down to just what is in the MPO
minority=minority[minority.apply(lambda r: any([town in str(r[0]) for town in mpoNums]), axis=1)]
#LEP
limEngProf = censusdata.download('acs5',2014,
censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ), ('county subdivision', '*')]),
lep).reset_index().rename(columns={'index':'muni'})
#create the actual LEP field
limEngProf['LEP#'] = limEngProf[list(limEngProf.columns)[:-1]].sum(axis=1)
#filter down to just what is in the MPO
limEngProf=limEngProf[limEngProf.apply(lambda r: any([town in str(r[0]) for town in mpoNums]), axis=1)]
#MEDIAN INCOME
medIncome = censusdata.download('acs5',2014,
censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ), ('county subdivision', '*')]),
medinc).reset_index().rename(columns={'index':'muni', 'B19013_001E':'MedianIncome'})
#filter down to just what is in the MPO
medIncome=medIncome[medIncome.apply(lambda r: any([town in str(r[0]) for town in mpoNums]), axis=1)]
#PEOPLE LIVING IN HOUSEHOLDS BELOW 200% OF THE POVERTY LINE
povHH = censusdata.download('acs5',2014,censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ),
('county subdivision', '*')]),
povPeop).reset_index().rename(columns={'index':'muni'})
#create the actual HH Pov field
povHH['POV#'] = povHH[list(povHH.columns)[:-1]].sum(axis=1)
#filter down to just what is in the MPO
povHH=povHH[povHH.apply(lambda r: any([town in str(r[0]) for town in mpoNums]), axis=1)]
#NO CAR HOUSEHOLDS
noCarHH = censusdata.download('acs5',2014,
censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ), ('county subdivision', '*')]),
nocar).reset_index().rename(columns={'index':'muni', 'B08201_002E':'NoCarHH#'})
#filter down to just what is in the MPO
noCarHH =noCarHH[noCarHH.apply(lambda r: any([town in str(r[0]) for town in mpoNums]), axis=1)]
#CHILDREN
kids = censusdata.download('acs5',2014,
censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ), ('county subdivision', '*')]),
u18).reset_index().rename(columns={'index':'muni'})
#create the actual kids field
kids['Under18#'] = kids[list(kids.columns)[:-1]].sum(axis=1)
#filter down to just what is in the MPO
kids=kids[kids.apply(lambda r: any([town in str(r[0]) for town in mpoNums]), axis=1)]
#SENIORS
seniors = censusdata.download('acs5',2014,
censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ), ('county subdivision', '*')]),
sen).reset_index().rename(columns={'index':'muni'})
#create the actual LEP field
seniors['Over65#'] = seniors[list(seniors.columns)[:-1]].sum(axis=1)
#filter down to just what is in the MPO
seniors=seniors[seniors.apply(lambda r: any([town in str(r[0]) for town in mpoNums]), axis=1)]
#DISABILITY
disability = censusdata.download('acs5',2014,
censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ), ('county subdivision', '*')]),
dis,'e4bec76221ba04c7df76c7c580659bf1f54ed2c1',
'subject').reset_index().rename(columns={'index':'muni','S1810_C02_001E':'Disability#'})
#filter down to just what is in the MPO
disability=disability[disability.apply(lambda r: any([town in str(r[0]) for town in mpoNums]), axis=1)]
#universe in same table pretty much only so do %calc here
disability['Disability%'] = disability['Disability#']/disability['S1810_C01_001E']
#HOUSEHOLDS
households = censusdata.download('acs5',2014,
censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ), ('county subdivision', '*')]),
acsHH).reset_index().rename(columns={'index':'muni', 'B11001_001E':'Households#'})
#filter down to just what is in the MPO
households =households[households.apply(lambda r: any([town in str(r[0]) for town in mpoNums]), axis=1)]
#OVER FIVE
#CHILDREN
over5 = censusdata.download('acs5',2014,
censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ), ('county subdivision', '*')]),
acsPlus5).reset_index().rename(columns={'index':'muni'})
#create the actual Over 5 field
over5['Over5#'] = (over5['B01001_001E'] - (over5['B01001_003E'] + over5['B01001_027E']))
#filter down to just what is in the MPO
over5=over5[over5.apply(lambda r: any([town in str(r[0]) for town in mpoNums]), axis=1)]
#TOTAL POPULATION
acsPop = censusdata.download('acs5',2014,
censusdata.censusgeo([('state', '25'), ('county','017,021,025,009,023,027' ), ('county subdivision', '*')]),
acsTotPop).reset_index().rename(columns={'index':'muni', 'B01001_001E':'TotalPopulation'})
#filter down to just what is in the MPO
acsPop =acsPop[acsPop.apply(lambda r: any([town in str(r[0]) for town in mpoNums]), axis=1)]
#Merge ACS data into one dataframe and make %
#collect all the data (and filter)
acsdata = [minority[['muni', 'Minority#', 'P005001']], limEngProf[['muni', 'LEP#']], medIncome[['muni', 'MedianIncome']],
povHH[['muni','POV#']], noCarHH[['muni', 'NoCarHH#']], kids[['muni', 'Under18#']], seniors[['muni', 'Over65#']],
disability[['muni', 'Disability#', 'Disability%']], households[['muni','Households#']], over5[['muni','Over5#']],
acsPop[['muni','TotalPopulation']]]
#actually do the merging
acsDF = reduce(lambda left,right: pd.merge(left,right,on=['muni'], how='outer'), acsdata)
#make the % fields
acsDF['Minority%'] = acsDF['Minority#']/acsDF['P005001']
acsDF['LEP%'] = acsDF['LEP#']/acsDF['Over5#']
acsDF['noCarHH%'] = acsDF['NoCarHH#']/acsDF['Households#']
acsDF['Under18%'] = acsDF['Under18#']/acsDF['TotalPopulation']
acsDF['Over65%'] = acsDF['Over65#']/acsDF['TotalPopulation']
acsDF['pov%'] = acsDF['POV#']/acsDF['TotalPopulation']
#bring in shapefile for area for population density
cousub= gpd.read_file(r'C:\Users\AtkinsonM\OneDrive - Commonwealth of Massachusetts\Documents\Jupyter_Home\tl_2010_25_cousub10')
#get area in sq miles (default to square meters because of CRS) and only use land area
cousub['Area'] = cousub['ALAND10']/2589988
#make a join field for bringing in the area from shapefile to the demographic data table
acsDF['County_Sub'] = acsDF.muni.astype(str).str.slice(-5, None)
#join area to attribute table
allvarDF = pd.merge(acsDF,cousub[['COUSUBFP10','Area']],left_on = 'County_Sub', right_on='COUSUBFP10', how='left')
#create population density field (in people per square mile)
allvarDF['PopDen']=allvarDF['TotalPopulation']/allvarDF['Area']
Make graphs of all the variables to see what the ranges and relationships look like
#create county field for color symbology based on county in census geo field (muni)
allvarDF['county'] = allvarDF.muni.astype(str).str.split(pat=',', n=2, expand=False).map(lambda x:x[1])
#make scatter plots for all the variables with each other
varscat = px.scatter_matrix(
allvarDF,
dimensions=['Minority%','LEP%','MedianIncome','noCarHH%','PopDen','Under18%','Over65%','Disability%',
'pov%'],
color="county",
width=2000, height=1400
)
varscat.update_traces(diagonal_visible=False)
varscat.update_yaxes(automargin=True)
varscat.show()
So you don't have to scroll if you don't want:
Description from https://scikit-learn.org/stable/modules/clustering.html#k-means:
"The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.
The k-means algorithm divides a set of samples into disjoint clusters , each described by the mean of the samples in the cluster. The means are commonly called the cluster “centroids”; note that they are not, in general, points from , although they live in the same space.
The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum-of-squares criterion:
Inertia can be recognized as a measure of how internally coherent clusters are. It suffers from various drawbacks:
Inertia makes the assumption that clusters are convex and isotropic, which is not always the case. It responds poorly to elongated clusters, or manifolds with irregular shapes.
Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”). Running a dimensionality reduction algorithm such as Principal component analysis (PCA) prior to k-means clustering can alleviate this problem and speed up the computations."
Drawbacks include that the algorithm is not deterministic - it will use random seed points each run which may not actually serve the algorithm well - for example all the seed points being close together. Having different seed points each time means that you get a different result every time you run this algorithm, even with the same parameters and order of input. Additionally - the algorithm uses all points in creating clusters meaning that noise is classified in clusters.
"Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually.
Silhouette coefficients (as these values are referred to as) near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster." from https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#sphx-glr-auto-examples-cluster-plot-kmeans-silhouette-analysis-py
Given the importance of the number of clusters to the K-means algorithm, I tested multiple ways of determining the ideal number of clusters for the input data. Silhouette Analysis is just one way to determine this. In this case, I am iteratively running a Sihouette Analysis (and visualizing it) on a range of cluster numbers.
Silhouette analysis is also used further below in testing the fit of the DBSCAN model clusters to the data. Sihouette analysis can be used with multiple algorithms.
X=allvarDF[['Minority%','LEP%','MedianIncome','noCarHH%','PopDen','Under18%','Over65%','Disability%',
'pov%','County_Sub']].set_index('County_Sub')
range_n_clusters = [2, 3, 4, 5, 6]
for n_clusters in range_n_clusters:
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = pyplot.subplots(1, 2)
fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = metrics.silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = metrics.silhouette_samples(X, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
# 2nd Plot showing the actual clusters formed
colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(X.iloc[:, 0], X.iloc[:, 2], marker='.', s=30, lw=0, alpha=0.7,
c=colors, edgecolor='k') #this only maps the locations of one variable per index (using MedianIncome here)
#so shape is not going to be accurate on any of the scatter plots below
#this is why I use the parallel lines chart
# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
c="white", alpha=1, s=200, edgecolor='k')
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
s=50, edgecolor='k')
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
pyplot.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
pyplot.show()
For n_clusters = 2 The average silhouette_score is : 0.5569444577817235 For n_clusters = 3 The average silhouette_score is : 0.5513182541344039 For n_clusters = 4 The average silhouette_score is : 0.5106917220793826 For n_clusters = 5 The average silhouette_score is : 0.48784756550212927 For n_clusters = 6 The average silhouette_score is : 0.48323372450816343
I think something to note here is that we may still get a good cluster out of this even if most of the points are not clustered well. For example, 1 > 0 in silhouette score. But for cluster 1 the drop off is also larger and there are way more points adding to the variation. Something to think about if this all goes sideways.
An alternate way to find the ideal number of clusters to run K-means with is the very popular Elbow diagram. In this method, you iterate through a list of values for k (number of clusters) as inputs to the K-means algorithm. After running the K-means algorithm with the k value, the distortion and inertia is calculated:
from https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/
The ideal number of clusters should be at the 'elbow' of the line in the diagram. It is the k value where the distortion/inertia stops dropping as rapidly as k increases. See below for examples.
#make elbow diagram
X=allvarDF[['Minority%','LEP%','MedianIncome','noCarHH%','PopDen','Under18%','Over65%','Disability%',
'pov%','County_Sub']].set_index('County_Sub')
distortions = []
inertias = []
mapping1 = {}
mapping2 = {}
K = range(1, 10)
for k in K:
# Building and fitting the model
kmeanModel = KMeans(n_clusters=k).fit(X)
kmeanModel.fit(X)
distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_,
'euclidean'), axis=1)) / X.shape[0])
inertias.append(kmeanModel.inertia_)
mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_,
'euclidean'), axis=1)) / X.shape[0]
mapping2[k] = kmeanModel.inertia_
#distortion elbow diagram
matplotlib.pyplot.plot(K, distortions, 'bx-')
matplotlib.pyplot.xlabel('Values of K')
matplotlib.pyplot.ylabel('Distortion')
matplotlib.pyplot.title('The Elbow Method using Distortion')
matplotlib.pyplot.show()
#This shows an elbow at 3 clusters
#inertia elbow diagram
matplotlib.pyplot.plot(K, inertias, 'bx-')
matplotlib.pyplot.xlabel('Values of K')
matplotlib.pyplot.ylabel('Inertia')
matplotlib.pyplot.title('The Elbow Method using Inertia')
matplotlib.pyplot.show()
#This shows an elbow at 3 clusters
(https://static1.squarespace.com/static/5ff2adbe3fe4fe33db902812/t/6062a083acbfe82c7195b27d/1617076404560/ISLR%2BSeventh%2BPrinting.pdf) Algorithm 10.1 K-Means Clustering
Randomly assign a number, from 1 to K, to each of the observations.
a. K-means++ initializes the the randomly assigned clusters with cluster centroids that are not as random. While it may not be the exact same every time, the initial cluster centroids are dispersed enough to be useful. Using K-means++ makes the algorithm more deterministic, though not totally. These serve as initial cluster assignments for the observations.
Iterate until the cluster assignments stop changing:
a. For each of the K clusters, compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster.
b. Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance).
#testing K-means
kmeans = KMeans(n_clusters=3,init='k-means++', random_state=0).fit(X)
#put clusters into dataset
X['KmeansClusters'] = pd.Series(kmeans.labels_, index=X.index)
X
| Minority% | LEP% | MedianIncome | noCarHH% | PopDen | Under18% | Over65% | Disability% | pov% | KmeansClusters | |
|---|---|---|---|---|---|---|---|---|---|---|
| County_Sub | ||||||||||
| 05595 | 0.09 | 0.02 | 73980 | 0.10 | 2674.71 | 0.17 | 0.13 | 0.10 | 0.17 | 1 |
| 16250 | 0.06 | 0.03 | 77404 | 0.06 | 2039.45 | 0.20 | 0.16 | 0.10 | 0.11 | 1 |
| 21850 | 0.04 | 0.02 | 89185 | 0.03 | 256.23 | 0.16 | 0.15 | 0.08 | 0.12 | 0 |
| 26150 | 0.06 | 0.04 | 60229 | 0.10 | 1116.08 | 0.16 | 0.18 | 0.12 | 0.23 | 1 |
| 27900 | 0.09 | 0.03 | 108558 | 0.02 | 570.91 | 0.25 | 0.13 | 0.07 | 0.08 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 56585 | 0.38 | 0.23 | 50900 | 0.19 | 9359.72 | 0.18 | 0.12 | 0.14 | 0.31 | 1 |
| 81005 | 0.11 | 0.06 | 64169 | 0.12 | 9111.04 | 0.18 | 0.16 | 0.12 | 0.20 | 1 |
| 63165 | 0.14 | 0.02 | 149375 | 0.02 | 703.81 | 0.25 | 0.11 | 0.06 | 0.06 | 2 |
| 41165 | 0.17 | 0.11 | 68007 | 0.08 | 1911.96 | 0.21 | 0.11 | 0.10 | 0.21 | 1 |
| 06365 | 0.07 | 0.01 | 144461 | 0.00 | 251.55 | 0.26 | 0.10 | 0.05 | 0.08 | 2 |
97 rows × 10 columns
#Show Kmeans Clusters on a map
Kmap = cousub.merge(X.reset_index(), how = 'right', left_on = 'COUSUBFP10', right_on = 'County_Sub')
#Kmap.plot(column='KmeansClusters')
fig = px.choropleth(Kmap,
geojson=Kmap.geometry,
locations=Kmap.index,
color="KmeansClusters",
projection="mercator")
fig.update_geos(fitbounds="locations", visible=False)
fig.show()
#Visualize Clusters with Parallel Coordinates Plot
fig = px.parallel_coordinates(X, color="KmeansClusters",
dimensions=['Minority%','LEP%','MedianIncome',
'noCarHH%','PopDen','Under18%','Over65%','Disability%','pov%'] ,
labels={'Minority%':'Minority %', 'LEP%':'LEP %',
'MedianIncome':'Median Income',
'noCarHH%':'No Car HH %','PopDen':'Population Density',
'Under18%':'Children','Over65%':'Seniors',
'Disability%':'People with Disabilities %',
'pov%':'Poverty %', },
color_continuous_scale=px.colors.sequential.Plasma)
fig.show()
Essentially - we do get three clusters and they are interesting
What the above parallel coordinates plot show is a visualization of the clusters based on the data for each town. As you can see, Yellow (2) and Blue (0) are pretty similar, but Pink (1) looks like a combination of outliers and data following a different pattern. Something that DBSCAN does naturally is not assign outliers to clusters. See below.
Given the results and limitations of K-means, I felt it was necessary to try an alternate algorithm with different strengths and weaknesses.
"The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples). There are two parameters to the algorithm, min_samples and eps, which define formally what we mean when we say dense. Higher min_samples or lower eps indicate higher density necessary to form a cluster.
More formally, we define a core sample as being a sample in the dataset such that there exist min_samples other samples within a distance of eps, which are defined as neighbors of the core sample. This tells us that the core sample is in a dense area of the vector space. A cluster is a set of core samples that can be built by recursively taking a core sample, finding all of its neighbors that are core samples, finding all of their neighbors that are core samples, and so on. A cluster also has a set of non-core samples, which are samples that are neighbors of a core sample in the cluster but are not themselves core samples. Intuitively, these samples are on the fringes of a cluster." from https://scikit-learn.org/stable/modules/clustering.html#dbscan
DBSCAN is deterministic when the data is fed into the same order into the algorithm - so generally will be the same every time the algorithm is run unless the user re-sorts the data, but significantly more deterministic naturally than k-means without k-means++. The major benefit is the separation of noise and the use of density to identify clusters allowing for any shape of cluster.
DBSCAN Documentation
As explained above, the main DBSCAN parameters are eps and min_samples. The work below shows how to optimize the eps parameter. The optimal min_samples was determined by rerunning the cell below with the min_samples as every number between the number of variables and twice the number of variables.
#https://medium.com/@tarammullin/dbscan-parameter-estimation-ff8330e3a3bd
#Set Parameters
Y = allvarDF[['Minority%','LEP%','MedianIncome','noCarHH%',
'PopDen','Under18%','Over65%','Disability%','pov%', 'County_Sub']].set_index('County_Sub')
#Minimum Samples
#presume that min number of neighbors should be between 9 and 18 (the number of vectors, and double the number)
#Eps
#use nearest neighbors to estimate
neighbors = NearestNeighbors(n_neighbors=9) #use min neighbors here
neighbors_fit = neighbors.fit(Y) #actually calculate
distances, indices = neighbors_fit.kneighbors(Y) #grab results
#prepare and map
distances = np.sort(distances, axis=0)
distances = distances[:,1]
eps_plot = px.line(distances, title='average distance between each point and its n_neighbors')
eps_plot.show()
#Graph shows where diminishing returns is at the elbow - in this case 4934
Additionally, run a Silhouette Analysis so that after running the DBSCAN algorithm, we can see how useful the clusters are. This time, since we do not need to run through a few different numbers of clusters, we can do this once. While I don't vizualize the Sihouette Analysis for the DBSCAN results, one could. I deemed it unnecessary given the low average Sihouette Coefficient, meaning a not so stellar fit. The data points are often close to the edge of the clusters. See Silhouette Analysis explanation above.
#try DBSCAN (https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-py)
# Compute DBSCAN
db = DBSCAN(eps=4934, min_samples=9).fit(Y)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
#This is a Clustering Performance Evaluation - especially because no ground truth
#see: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(Y, labels))
#As a note: the higher the Silouette Coefficient is, the model has better defined clusters. (-1 to +1)
#Here we do not have very well defined clusters, they likely do not overlap though given the lack of proximity to 0.
#Also note if used standardized data - get the results below:
#Estimated number of clusters: 1
#Estimated number of noise points: 26
#Silhouette Coefficient: 0.329
Estimated number of clusters: 2 Estimated number of noise points: 34 Silhouette Coefficient: 0.264
While not terribly useful given the average silhouette coefficient is so low, it does give us an idea of what the clusters look like. The clusters seem to confirm what the concerns silhouette analysis raised. It looks like the Yellow and Red clusters don't have much differentiating them and a lot of noise points that overalp the main cluster of data.
#make scatter plots for all the variables with each other
Y['DBSCAN_Clusters'] = labels
varscat2 = px.scatter_matrix(
Y,
dimensions=['Minority%','LEP%','MedianIncome','noCarHH%','PopDen','Under18%','Over65%','Disability%',
'pov%'],
color='DBSCAN_Clusters',
width=2000, height=1400
)
varscat2.update_traces(diagonal_visible=False)
varscat2.update_yaxes(automargin=True)
varscat2.show()
#There are 2 clusters - dark blue is NOISE (marker = -1). The two clusters are Pink and Yellow - 0 and 1 respectively.
# As you can see from looking at all the combinations of the dimensions, I can't see a strong case for these clusters being
# useful. While looking at a 2D space is different than a 9D space, there are no plots below that show clear clusters of
# yellow or pink. They seem to overlap or just look like they are right next to eachother, making it hard to understand
# in 2D why they would be considered part of different clusters. In 9D, it is likely that they are clusters, but are not
# very distinguishable from eachother, per the result of the Silhouette Analysis.
At this point - I have two algorithms that are not finding substantial evidence of this data being clustered. Trying more algorithms that may be more optimized to the problem would be useful only if it was determined that there were clusters, they were just not being correctly assigned. In this case, it seems unlikely with the current variables. I decided not to run any Hierarchical Clustering or Neural Netowrk algorithms in favor of doing a Principal Component Analysis to reduce the dimensions (variables) of the problem.
Since our results have not been substantial with nine dimensions, another way to see if there are clusters is to create them based on the relationships between the dimensions.
"PCA finds which features are most correlated in a dataset, and removes them, leaving you with the most “important” features - i.e., the “principal components.” If you wanted to choose a single feature from this dataset to represent the entire dataset, you would want to select the one that contains the most information. When faced with a gigantic, high-dimensional dataset, you could pick and choose the features you think are important. But this can be inefficient, and even worse, can lead to us missing out on potentially important data! PCA can provide a rigorous way of determining which features are important.
This is the fundamental idea of PCA: select the features that contain the most information. So how do we determine which features contain the most information? We measure variation (or, how different a feature is from another) as a proxy for information.
Why does measuring variation make sense? When two features are correlated, it doesn’t make sense to include them both in a model, because you only need one to capture the information of both. In PCA, we use a similar concept to detect and remove redundant and non-informative features. If a feature contributes very little variation, it can be removed." from https://drive.google.com/file/d/1YdA-HHYP1V05QgvwLCvfnuuau67Zl38n/view (Delta Analytics)
Results from this can be resused in both of the algorithms to see if better results can be achieved.
https://builtin.com/data-science/step-step-explanation-principal-component-analysis
#must standardize first
#"Mathematically, this can be done by subtracting the mean and dividing
#by the standard deviation for each value of each variable."
#"the reason why it is critical to perform standardization prior to PCA,
#is that the latter is quite sensitive regarding the variances of the initial variables."
# from https://builtin.com/data-science/step-step-explanation-principal-component-analysis
#create empty df and add all the columns but standardized.
std = allvarDF[['Minority%','LEP%','MedianIncome','noCarHH%','PopDen','Under18%','Over65%',
'Disability%','pov%', 'county','County_Sub']].set_index('County_Sub')
for x in ['Minority%','LEP%','MedianIncome','noCarHH%','PopDen','Under18%','Over65%','Disability%','pov%']:
std[x] = (std[x] - std[x].mean())/std[x].std()
std
| Minority% | LEP% | MedianIncome | noCarHH% | PopDen | Under18% | Over65% | Disability% | pov% | county | |
|---|---|---|---|---|---|---|---|---|---|---|
| County_Sub | ||||||||||
| 05595 | -0.54 | -0.63 | -0.79 | 0.41 | -0.12 | -1.05 | -0.05 | 0.49 | 0.47 | Essex County |
| 16250 | -0.72 | -0.52 | -0.68 | -0.14 | -0.28 | -0.18 | 0.80 | 0.20 | -0.26 | Essex County |
| 21850 | -0.90 | -0.53 | -0.31 | -0.60 | -0.75 | -1.19 | 0.41 | -0.35 | -0.15 | Essex County |
| 26150 | -0.75 | -0.34 | -1.23 | 0.36 | -0.52 | -1.18 | 1.47 | 1.10 | 1.20 | Essex County |
| 27900 | -0.54 | -0.49 | 0.31 | -0.77 | -0.67 | 1.08 | -0.30 | -0.61 | -0.73 | Essex County |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 56585 | 1.63 | 2.51 | -1.53 | 1.78 | 1.63 | -0.80 | -0.41 | 2.07 | 2.24 | Suffolk County |
| 81005 | -0.33 | 0.02 | -1.10 | 0.65 | 1.56 | -0.66 | 0.74 | 1.06 | 0.86 | Suffolk County |
| 63165 | -0.14 | -0.55 | 1.61 | -0.78 | -0.63 | 1.08 | -1.07 | -1.10 | -0.97 | Worcester County |
| 41165 | 0.12 | 0.74 | -0.98 | 0.15 | -0.32 | 0.20 | -0.76 | 0.46 | 0.93 | Worcester County |
| 06365 | -0.70 | -0.69 | 1.46 | -1.07 | -0.75 | 1.38 | -1.23 | -1.59 | -0.72 | Worcester County |
97 rows × 10 columns
#For before below but basically show the categories by % explanatory power for variance with additive PCs
#actually run PCA again and show results
pca = PCA(whiten=True) #all variables/components kept
#components are the eigenvalues/vectors
components = pca.fit_transform(std[['Minority%','LEP%','MedianIncome','noCarHH%','PopDen','Under18%',
'Over65%','Disability%','pov%']])
pyplot.plot(pca.explained_variance_ratio_.cumsum(), marker = 'o')
pyplot.title('Explained Variance by Components')
pyplot.xlabel('Number of Components')
pyplot.ylabel('Cumulative Explained Variance')
#Just a note - PC1 is 0, PC2 is 1-0 etc. - looks like we should use 3 PCs according to
# https://365datascience.com/tutorials/python-tutorials/pca-k-means/
# they say the rule of thumb is to preserve about 80% of variance, and 2 Pcs is a little below 80 and 3 is above.
Text(0, 0.5, 'Cumulative Explained Variance')
#actually run PCA again and show results
pca = PCA(whiten=True) #all variables/components kept
#components are the eigenvalues/vectors
components = pca.fit_transform(std[['Minority%','LEP%','MedianIncome','noCarHH%','PopDen','Under18%',
'Over65%','Disability%','pov%']])
#for each principal component, what is the percent of variance it explains
labels = {
str(i): f"PC {i+1} ({var:.1f}%)"
for i, var in enumerate(pca.explained_variance_ratio_ * 100)
}
#display!
#range is three because I know after this point the Principal Components have diminishing returns
pcaPlot = px.scatter_matrix(
components,
labels=labels,
dimensions=range(3),
color=std["county"]
)
pcaPlot.update_traces(diagonal_visible=False)
pcaPlot.show()
#make new df for use in reruns of algorithms
postPCA = pd.concat([allvarDF, pd.DataFrame(components)], axis=1)
postPCA.columns.values[-9:] = ['PC1', 'PC2','PC3','PC4', 'PC5','PC6','PC7', 'PC8','PC9']
#########################make elbow diagram#########################
K = range(1, 10)
distortions_PCA = []
inertias_PCA = []
mapping1_PCA = {}
mapping2_PCA = {}
K_PCA = range(1, 10)
pp = postPCA[['PC1', 'PC2','PC3', 'County_Sub']].set_index('County_Sub')
for k in K:
# Building and fitting the model
kmeanModel_PCA = KMeans(n_clusters=k).fit(pp)
kmeanModel_PCA.fit(pp)
distortions_PCA.append(sum(np.min(cdist(pp, kmeanModel_PCA.cluster_centers_,
'euclidean'), axis=1)) / pp.shape[0])
inertias_PCA.append(kmeanModel_PCA.inertia_)
mapping1_PCA[k] = sum(np.min(cdist(pp, kmeanModel_PCA.cluster_centers_,
'euclidean'), axis=1)) / pp.shape[0]
mapping2_PCA[k] = kmeanModel_PCA.inertia_
#distortion elbow diagram
matplotlib.pyplot.plot(K_PCA, distortions_PCA, 'bx-')
matplotlib.pyplot.xlabel('Values of K')
matplotlib.pyplot.ylabel('Distortion')
matplotlib.pyplot.title('The Elbow Method using Distortion')
matplotlib.pyplot.show()
#This doesn't show much of an elbow - use 3? 4? Really - shouldn't use this chart because it drops more consistently
# than is useful for us. I guess the real note is that distortion is not necessarily going to be optimized during K-means
# running on the data.
#inertia elbow diagram
matplotlib.pyplot.plot(K_PCA, inertias_PCA, 'bx-')
matplotlib.pyplot.xlabel('Values of K')
matplotlib.pyplot.ylabel('Inertia')
matplotlib.pyplot.title('The Elbow Method using Inertia')
matplotlib.pyplot.show()
#This shows an elbow at 4 clusters
Just keeping in good faith by looking at the Silhouette for K-Means - you can see by looking at the diagrams none are very good. Most have at least some negative points meaning that the data was likely put into the wrong cluster. The silhouette analysis has worse results with PCA for K-means than with the original nine dimensions.
Something to note is that while 2 clusters has the highest score, neither elbow diagram would agree with that. 4 clusters has the next highest score, though not very high, which matches the inertia elbow diagram.
Lastly, the scatter plot points are plotted by PC1 and PC2. PC3 will be used for plotting later in the analysis.
X=postPCA[['PC1', 'PC2','PC3', 'County_Sub']].set_index('County_Sub')
range_n_clusters = [2, 3, 4, 5, 6,7,8]
for n_clusters in range_n_clusters:
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = pyplot.subplots(1, 2)
fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = metrics.silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = metrics.silhouette_samples(X, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
# 2nd Plot showing the actual clusters formed
colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(X.iloc[:, 0], X.iloc[:, 2], marker='.', s=30, lw=0, alpha=0.7,
c=colors, edgecolor='k') #this only maps the locations of one variable per index (using MedianIncome here)
#so shape is not going to be accurate on any of the scatter plots below
#this is why I use the parallel lines chart
# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
c="white", alpha=1, s=200, edgecolor='k')
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
s=50, edgecolor='k')
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
pyplot.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
pyplot.show()
For n_clusters = 2 The average silhouette_score is : 0.5176835531676238 For n_clusters = 3 The average silhouette_score is : 0.3427293290076725 For n_clusters = 4 The average silhouette_score is : 0.3834325517471558 For n_clusters = 5 The average silhouette_score is : 0.35767904774533077 For n_clusters = 6 The average silhouette_score is : 0.3047048260090521 For n_clusters = 7 The average silhouette_score is : 0.31879635314551663 For n_clusters = 8 The average silhouette_score is : 0.3200541270599367
#testing K-means
kmeans_PCA = KMeans(n_clusters=4,init='k-means++', random_state=0).fit(pp)
#put clusters into dataset
postPCA['KmeansClusters'] = pd.Series(kmeans_PCA.labels_, index=postPCA.index)
postPCA
| muni | Minority# | P005001 | LEP# | MedianIncome | POV# | NoCarHH# | Under18# | Over65# | Disability# | ... | PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | KmeansClusters | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Beverly city, Essex County, Massachusetts: Sum... | 3397 | 39502 | 674 | 73980 | 6912 | 1583 | 6669 | 5413 | 4111 | ... | 0.24 | -0.91 | 0.70 | -1.05 | -0.48 | -0.03 | 1.18 | -1.03 | 0.28 | 1 |
| 1 | Danvers town, Essex County, Massachusetts: Sum... | 1654 | 26493 | 653 | 77404 | 3077 | 643 | 5400 | 4278 | 2532 | ... | -0.22 | -0.91 | 0.15 | 0.01 | -0.85 | 0.69 | 0.15 | 0.81 | -0.70 | 1 |
| 2 | Essex town, Essex County, Massachusetts: Summa... | 135 | 3504 | 84 | 89185 | 438 | 44 | 573 | 527 | 289 | ... | -0.37 | -0.92 | 0.78 | -0.76 | 0.90 | 1.67 | 0.39 | -1.33 | 0.07 | 1 |
| 3 | Gloucester city, Essex County, Massachusetts: ... | 1689 | 28789 | 1037 | 60229 | 6700 | 1175 | 4683 | 5172 | 3458 | ... | 0.37 | -1.95 | -0.11 | 0.21 | -0.74 | 1.00 | 1.50 | -2.43 | 0.82 | 1 |
| 4 | Hamilton town, Essex County, Massachusetts: Su... | 676 | 7764 | 203 | 108558 | 614 | 57 | 2018 | 1031 | 599 | ... | -0.75 | 0.43 | -0.50 | -0.42 | -0.71 | 0.43 | -0.05 | 0.83 | -0.26 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 92 | Revere city, Suffolk County, Massachusetts: Su... | 19456 | 51755 | 11545 | 50900 | 16635 | 3848 | 9328 | 6608 | 7685 | ... | 2.21 | -0.14 | -1.19 | 0.41 | -0.16 | -0.47 | -1.10 | -1.29 | -1.47 | 2 |
| 93 | Winthrop Town city, Suffolk County, Massachuse... | 2011 | 17497 | 1049 | 64169 | 3638 | 863 | 3248 | 2815 | 2123 | ... | 0.79 | -1.01 | 0.70 | 0.42 | -1.82 | -0.47 | -1.86 | 0.25 | 1.63 | 1 |
| 94 | Southborough town, Worcester County, Massachus... | 1362 | 9767 | 216 | 149375 | 563 | 63 | 2459 | 1043 | 601 | ... | -0.94 | 1.33 | -0.06 | -0.52 | 0.61 | -0.94 | 0.08 | -0.39 | 0.47 | 0 |
| 95 | Milford town, Worcester County, Massachusetts:... | 4895 | 27999 | 2922 | 68007 | 5856 | 880 | 6047 | 3225 | 2869 | ... | 0.50 | 0.06 | -1.19 | -1.00 | -0.72 | 1.20 | 0.50 | -1.07 | -0.77 | 1 |
| 96 | Bolton town, Worcester County, Massachusetts: ... | 320 | 4897 | 63 | 144461 | 384 | 0 | 1310 | 508 | 240 | ... | -1.14 | 1.44 | -0.01 | -0.87 | -0.30 | 0.26 | -0.04 | -1.36 | 1.32 | 0 |
97 rows × 35 columns
#3D map showing Kmeans clusters but also shows the used PCs in space
fig_3D = px.scatter_3d(postPCA, x='PC1', y='PC2', z='PC3', color ='KmeansClusters' )
fig_3D.show()
As you can see above - it looks like the Yellow and Orange clusters are outliers and the Blue and Purple don't necessarily seem like it should be two clusters.
Below - you can see what the clusters look like on a map. Doesn't seem like anything we would predict except for Yellow, which is fine given that we are looking for patterns that have meaning but also are different from what we would predict. The issue is the level of meaning that these clusters bring, which is not high.
#Show Kmeans Clusters on a map
Kmap_PCA = cousub.merge(postPCA.reset_index(), how = 'right', left_on = 'COUSUBFP10', right_on = 'County_Sub')
#Kmap.plot(column='KmeansClusters')
fig_PCA = px.choropleth(Kmap_PCA,
geojson=Kmap_PCA.geometry,
locations=Kmap_PCA.index,
color="KmeansClusters",
projection="mercator")
fig_PCA.update_geos(fitbounds="locations", visible=False)
fig_PCA.show()
#Visualize Clusters with Parallel Coordinates Plot
fig_PCA2 = px.parallel_coordinates(postPCA, color="KmeansClusters",
dimensions=['PC1', 'PC2', 'PC3'] ,
labels={'PC1':'Principal Component 1','PC2':'Principal Component 1',
'PC3':'Principal Component 3', },
color_continuous_scale=px.colors.sequential.Plasma)
fig_PCA2.show()
Confirming what we saw in the 3D Scatter plot, we can see that there is overlap between Purple and Blue and that while Yellow and Orange are distinct, they are somewhat spread out and only include the minority of features.
Along with the low silhouette score, it doesn't look like the K-means clusters on the data post PCA are particularly helpful as most of the features are in two clusters that do not necessarily look like they should be two separate clusters. Given the outliers of Yellow and Orange, let's run DBSCAN again to see what clusters it predicts.
#https://medium.com/@tarammullin/dbscan-parameter-estimation-ff8330e3a3bd
#Set Parameters
ypca = postPCA[['PC1', 'PC2','PC3', 'County_Sub']].set_index('County_Sub')
#Minimum Samples
#presume that min number of neighbors should be between 3 and 6 (the number of vectors, and double the number)
#Eps
#use nearest neighbors to estimate
neighbors_pca = NearestNeighbors(n_neighbors=6) #after running DBSCAN a few times looks like 6 is optimal
neighbors_fit_pca = neighbors_pca.fit(ypca) #actually calculate
distances_pca, indices = neighbors_fit_pca.kneighbors(ypca) #grab results
#prepare and map
distances_pca = np.sort(distances_pca, axis=0)
distances_pca = distances_pca[:,1]
eps_plot_pca = px.line(distances_pca, title='average distance between each point and its n_neighbors')
eps_plot_pca.show()
#Graph shows where diminishing returns is at the elbow - in this case 0.72? 0.86?
# #############################################################################
# Compute DBSCAN
ypca_db = DBSCAN(eps=0.79, min_samples=6).fit(ypca)
core_samples_mask_pca = np.zeros_like(ypca_db.labels_, dtype=bool)
core_samples_mask_pca[ypca_db.core_sample_indices_] = True
labels_pca = ypca_db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_pca = len(set(labels_pca)) - (1 if -1 in labels_pca else 0)
n_noise_pca = list(labels_pca).count(-1)
print('Estimated number of clusters: %d' % n_clusters_pca)
print('Estimated number of noise points: %d' % n_noise_pca)
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(ypca, labels_pca))
Estimated number of clusters: 1 Estimated number of noise points: 21 Silhouette Coefficient: 0.399
That doesn't look so good as we are only getting one cluster that contains most of the points, but.... the silhouette coefficient looks better than previous iterations! So fine fit, bad result.
Lets go forward anyways and see what it looks like.
#3D map showing Kmeans clusters but also shows the used PCs in space
ypca['DBSCAN_Clusters'] = labels_pca
fig_3D_DB = px.scatter_3d(ypca, x='PC1', y='PC2', z='PC3', color ='DBSCAN_Clusters' )
fig_3D_DB.show()
#YELLOW IS CLUSTER, BLUE IS NOISE
This classification makes sense for DBSCAN in the 3D space as the yellow cluster is dense where as the noise (blue) is not dense at all. However, I don't think this result is very helpful in determining patterns within the data. We have one cluster with the majority of features and noise which is not very related at all. Since the point of this exercise was to seek similarities and differences between county subdivisions based on demographic data, this does not fulfill that goal.